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Abstract 

Background: The enumeration of chemical graphs (molecular graphs) satisfying given constraints is one of the 
fundamental problems in chemoinformatics and bioinformatics because it leads to a variety of useful applications 
including structure determination and development of novel chemical compounds. 

Results: We consider the problem of enumerating chemical graphs with monocyclic structure (a graph structure 
that contains exactly one cycle) from a given set of feature vectors, where a feature vector represents the frequency of 
the prescribed paths in a chemical compound to be constructed and the set is specified by a pair of upper and lower 
feature vectors. To enumerate all tree-like (acyclic) chemical graphs from a given set of feature vectors, Shimizu et al. 
and Suzuki et al. proposed efficient branch-and-bound algorithms based on a fast tree enumeration algorithm. In this 
study, we devise a novel method for extending these algorithms to enumeration of chemical graphs with monocyclic 
structure by designing a fast algorithm for testing uniqueness. The results of computational experiments reveal that the 
computational efficiency of the new algorithm is as good as those for enumeration of tree-like chemical compounds. 

Conclusions: We succeed in expanding the class of chemical graphs that are able to be enumerated efficiently. 
Keywords: Chemical graphs, Enumeration, Monocyclic structure, Feature vector 



Introduction 

The enumeration of chemical structures satisfying given 
constraints is an important topic in chemoinformat- 
ics [1-3]. Applications of the enumeration of chemical 
structures include structure determination using mass- 
spectrum and/or NMR-spectrum [4,5], virtual explo- 
ration of the chemical universe [6,7], reconstruction of 
molecular structures from their signatures [8,9], and 
classification of chemical compounds [10]. The enu- 
meration problem is also important for development of 
novel chemical compounds because virtual exploration 
of chemical universe and reconstruction of molecular 
structures from their signatures are considered to be 
important elementary technologies. The enumeration of 
chemical structures has a long history. Cayley [11] con- 
sidered the enumeration of structural isomers of alka- 
nes in the 19th century. The seminal work of Polya on 
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counting the number of isomers using group theory is also 
famous [12]. 

In this paper, we consider the problem of enumerating 
chemical structures having monocyclic graph structures 
satisfying a given constraint, where a monocyclic graph 
is an undirected connected graph containing exactly one 
cycle (a graph is connected if there exists a path connect- 
ing every pair of vertices), and a constraint is given in the 
form of a set of feature vectors (i.e., a set of descriptors). 
We assume that each feature vector specifies the num- 
ber of occurrences of each labeled path of length up to 
a given constant K, where a labeled path is an alternat- 
ing sequence of atom names and bond types (see Figure 1 
for an example of a feature vector). We also assume that a 
set of feature vectors is given by specifying the minimum 
and maximum numbers of occurrences of each labeled 
path. We develop an efficient algorithm for this enumera- 
tion problem by extending existing algorithms [13,14] for 
enumerating tree-like chemical structures (i.e., chemical 
structures without cycles). In this extension, some novel 
concepts are introduced and rigorous mathematical anal- 
ysis is performed in orer to guarantee the correctness of 
the algorithm. 
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Figure 1 Example of a chemical structure and its feature vector. 

In this example, a feature vector consists of the number of 
occurrences of each atom type and each bond type (e.g., C2C denote 
the double bond between two carbon atoms). Note that the entry of 
CI C is 8 because each bond is counted for both directions. Since 
there may exist multiple structures with the same feature vector, 
enumeration of such structures is required. 



In order to verify the computational efficiency of our 
proposed algorithm, we perform computational experi- 
ments using a set of some chemical compounds from the 
KEGG LIGAND database [15]. The results suggest that 
the proposed algorithm enumerates chemical structures 
having monocyclic graph structures as nearly efficiently as 
tree-like chemical graphs have been enumerated. 

The rest of this paper is organized as follows. First, we 
review some mathematical definitions and give a formal 
definition of the enumeration problem for chemical struc- 
tures with monocyclic graph structures. Next, we review 
background and related work. Then, we present the 
algorithm and the results of computational experiments. 
Finally, we conclude with future work. Mathematical 
proofs, pseudocodes for the algorithm, and some details 
on computational experiments are given in Additional 
file 1. 

Preliminaries and problem formulation 

This section reviews some basic definitions on graphs 
and formalizes the problem to be addressed in this work. 
Before providing formal descriptions, we briefly explain 
the problem definition using an example in Figure 2. 
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Figure 2 Example of a T-colored multi-graph G and its feature 
vector f\ (G). G represents a hydrogen-suppressed chemical graph, 
where deg(y;G) < val(c(v)) holds for some vertex v e V(G). Since 
each path is counted for both directions, the entry for C2C is not one, 
but two. 



Our basic problem is to enumerate all chemical struc- 
tures each of which is consistent with a given feature 
vector and the valence condition. Each coordinate of a 
feature represents the number of occurrences of vertex- 
and edge-labeled paths. In order to keep the size of a 
feature vector moderate, we restrict the length of paths 
to be no greater than a constant K. In the example of 
Figure 2, we consider paths of lengths 0 and 1, where a 
path of length 0 corresponds to a single atom and a path 
of length 1 corresponds to a bond including its endpoint 
atoms. For example, the columns O, N, and C of feature 
vector f\ (G) mean that each target structure must contain 
exactly one oxygen, two nitrogen, and three carbon atoms, 
respectively. The columns NIO, NIC, and C2C mean that 
each target structure must contain exactly one single bond 
connecting N and O, two single bonds connecting C and 
N, and one double bond connecting C and C. It should 
be noted that one single bond connecting N and 0 is 
counted by both 01N and NIO. Then, the chemical struc- 
ture G is consistent with f\ (G). However, another chemical 
structure may be consistent with a given feature vector. 
For example, the feature vector remains the same even if 
the double bond (along with the branching carbon atom) 
is moved into the backbone chain. Therefore, it is desir- 
able to enumerate all chemical structures consistent with 
a given feature vector and the valence condition (speci- 
fied by val(. . .)). On the other hand, there may not exist 
any consistent chemical structure if K is large; thus it may 
not be appropriate to uniquely specify a feature vector. 
Therefore, we assume in our target problem that upper 
and lower bounds of the number of occurrences of each 
labeled path are given as shown in Figure 3. 

A multi-graph is a graph that can have multiple edges 
between the same pair of vertices, where vertices corre- 
spond to atoms and multi-edges correspond to double and 
triple bonds in chemical compounds. We call a connected 
multi-graph a k-augmented tree if the number of adjacent 
vertex pairs (i.e., vertex pairs connected by edges) minus 
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Figure 3 Example of an input of EULF and part of its output. The 

input includes upper and lower feature vectors, and the output 
includes multi-trees Gi and G2 and a 1 -augmented tree G3. 
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the number of vertices is k — 1 (hence a multi-tree is a 0- 
augmented tree). That is, a /c-augmented tree is a graph 
obtained by adding edges to k different pairs of nonadja- 
cent vertices in a multi-tree (see Figure 4). The problem 
considered in this paper is to enumerate all 1-augmented 
trees satisfying specified upper and lower bound condi- 
tions on feature vectors. In the following, we provide the 
mathematical definition of the problem. We assume that 
readers have some familiarity with basic concepts in graph 
theory. For those who are not familiar with graph theory, 
we suggest referring to an appropriate textbook (e.g., [16]). 
Readers not interested in mathematical details can skip 
this part. 

A graph is defined to be an ordered pair (V, E) of a finite 
set V of vertices and a finite set E of edges, where an edge 
is an unordered pair of distinct vertices (thus no self-loop 
exists), where an edge with two end-vertices u and v is 
denoted by uv. A graph is called a multi-graph when E 
is not necessarily composed of distinct pairs of vertices 
(thus multiple edges are allowed in a multi-graph and E 
is no longer a set, but a multi-set), and is called a simple 
graph if no multiple edges are allowed. The multiplicity 
(the number of multiple edges) between two vertices u 
and v is denoted by m(u, v). An edge in a multi-graph G 




Figure 4 Examples of /(-augmented trees. A ^-augmented tree is 
obtained by adding k edges to a multi-tree, where a multi-tree is a 
tree with multiple bonds (precisely, multiple edges between the 
same pair(s) of vertices). 



is called simple if its multiplicity in G is one. We denote 
the vertex set and edge set of a graph G by V(G) and 
E(G), respectively. For a vertex v in a multi-graph G, let 
deg(v; G) denote the number of edges incident to ver- 
tex v (i.e., degree of v). In this paper, a cycle is a closed 
path with a length at least three (two edges with the 
same endvertices are not treated as a cycle), and a con- 
nected multi-graph (resp., simple graph) with no cycle is 
a multi-tree (resp., simple tree). A k-augmented tree is a 
connected multi-graph such that the number of adjacent 
vertex pairs (i.e., vertex pairs connected by edges) minus 
the number of vertices is k — 1. For two vertices u and 
v in a multi-graph G, let G + uv denote the multi-graph 
obtained by adding a new edge uv to G; when uv e E(G), 
let G — uv denote the multi-graph obtained by removing 
uv from G. Let Z + denote the set of nonnegative inte- 
gers, and let E be a set of colors, which correspond to 
chemical elements such as H, O and C. Let each color 
c e E be associated with a valence val(c) e Z + . A multi- 
graph G is said to be E — colored if each vertex v has 
a color c(v) e E. Chemical compounds can be viewed 
as ^-colored, self-loopless connected multi-graphs, where 
vertices and colors represent atoms and elements, 
respectively. 

Let d e Z + be a prescribed integer, which corre- 
sponds to the maximum multiplicity of chemical graphs, 
and E^ ,fl! denote the set of all alternating sequences 
(co, m\, c\, . . . , mk, Ck) consisting of colors Co, c\, . . . , e 
E and mi, mi, ■ ■ ■ , mk e [l,2,,..,d]. We denote the 
union of E°' rf , E w , Y? 4 , E w by E sW . Let F k (E, d) 
be the set of all mappings g from to Z + , i.e., 

T k (V,d) = {g:^ k ' d ^Z+}. 

For a path P = (vq, m\, v\, . . ., mk, Vk) such that V(P) 
■ Vi, . . . , v k }, E(P) = {v 0 vi, viv 2 , v k -iv k }, and = 
m(vi-i, v{) is the multiplicity of edge v^\Vi, the length of 
P is defined to be k = \ V(P)\ — 1, and the color sequence 
c(P) of P is defined to be the sequence c(P) = (c(vq), m\, 
c{v x ),...,mk,c{yk))&Y. k ' d . 

Given a multi-graph G and a sequence t e for 
some k, let occ(t, G) denote the number of paths P in G 
such that c(P) = t. For an integer K e Z + , the feature 
vector f K (G) of level K in G is defined to be the | E^ K ' d \- 
dimensional vector //c(G) whose value at each entry t e 
■£<K4 j s given by /k(G)[i] = occ(t, G). In this paper, we 
treat hydrogen-suppressed chemical graphs with carbon 
C, nitrogen N or oxygen 0, which are represented by E- 
colored multi-graphs G with color set E = {O, N, C}. 
Figure 2 illustrates an example of E -colored multi-graph 
G that represents a hydrogen-suppressed chemical graph 
and its feature vector /i(G). 

Note that in hydrogen-suppressed chemical graph G, 
deg(v; G) < val(c(v)) may hold for some vertex v e V(G). 
Let us define the residue degree res(v) of a vertex v to 
be val(c(v)) — deg(v; G). In a multi-graph G, we interpret 



Suzuki et al. Journal of Cheminformatics 2014, 6:31 
http://www.jcheminf.eom/content/6/1/31 



Page 4 of 18 



res(v) of a vertex v as the number of hydrogen atoms 
attached to the vertex v (in our proposed procedure, we 
also interpret res(v) as the number of new edges/bonds 
that can be attached to v when G is being constructed by 
adding more edges). 

For a vector g € ^(S, d) of level K > 1, a multi-graph 
G with Jk{G) = g is a multi-graph such that the occur- 
rence of each path t = (en, mi, c\, . . ., m p , c p ) in G with 
length of at most K is completely specified byg[i], in par- 
ticular V(G) = {t | g[t]> l,t e E w = E} (i.e., G has 
exactly g[t] vertices of color t), E(G) = {t \ g[t] > l,t = 
(c,m,c r ) € Y, 1,d } (i.e., G has exactly g[(c, m, c')] edges of 
multiplicity m that join a vertex of color c and a vertex of 
color c'). 

For two vectors gi,gu £ Fk(E>> d) and an integer k > 0, 
let GkigLtgu) denote the set of all S-colored A"-augmented 
trees G such thatgi < f K (G) < g u (le.,f K (G) = g' for 
some g with gi < g' < gu) and deg(v; G) < val(c(v)), 
v € K(G). 

Our problem is to enumerate all A~-augmented trees G 
on a given set of atoms each of which is consistent with 
one of the feature vectors between the lower and upper 
vectors gu,gL e F K {^, d), such that g L < gu (where 
g L [t] = gu[t] for all t e T,°' d since the vertex set is fixed 
for all G). 

In what follows, we fix a color set £ and an upper bound 
d on multiplicity. We define the problem of enumerating 
A"-augmented trees as follows. 

Enumerating chemical graphs with given upper and 
lower path frequency (EULF) Given a maximum path 
length K e Z+ and feature vectors gu>gL € Fi({T,,d) 
such that g L [t]= gu[t] for all t e 'E 0?d , enumerate all 
multi-graphs G € Gk(gL,gu)- 

Figure 3 illustrates an example of an input of EULF 
with upper and lower feature vectors gi and gu and part 
of its output, multi-trees G\, G2 e Go(gL,gu) and a 
1 -augmented tree G 3 e Giigugu)- 

For k = 0, we have developed an efficient algorithm for 
EULF [13,14]. The purpose of this work is to describe an 
algorithm for EULF with k = 1. We assume that the max- 
imum valence is 4 and mainly enumerate a 1 -augmented 
tree such that the cycle contains an edge of multiplic- 
ity one (a single bond), since otherwise a 1-augmented 
tree is a single cycle consisting of edges of multiplicity 
two, which can be separately handled as a special rare 
case. 

Background 

As mentioned in Introduction, enumeration of chemi- 
cal structures has a long history and many studies have 
been done. In the field of machine learning, a simi- 
lar problem, which is called the preimage problem, has 
been studied [17,18]. In this problem, a desired object 



is computed as a feature vector in a feature space, 
and then the feature vector is mapped back to the 
input space, where this mapped back object is called 
a preimage. The definition of the feature vectors based 
on the frequency of labeled paths [19,20] or small 
fragments [10,21] has been widely used. Akutsu and 
Fukagawa [22] formulated the graph preimage problem 
as the problem of inferring graphs from the frequency 
of paths of labeled vertices and proved that the problem 
is computationally intractable (NP-hard) even for pla- 
nar graphs with bounded degrees [22]. Nagamochi [23] 
proved that a graph determined by the frequency of paths 
with length one can be found in polynomial time if any 
exists. 

The preimage problem has also been studied in the field 
of chemoinformatics as a part of inverse QSAR/QSPR 
(quantitative structure-activity relationship/quantitative 
structure-property relationship) studies. Indeed, the 
problem is essentially the same as reconstruction and/or 
enumeration of molecules from their descriptors in 
inverse QSAR/QSPR [8,9,24,25], where the descriptors 
correspond to feature vectors in the preimage problem. 
Wong and Burkowski developed a practical preimage 
based method and demonstrated that it actually generated 
the structure of a new drug candidate [26]. For enumer- 
ation of molecules from descriptors, useful tools such as 
MOLGEN have been developed [27]. However, they are 
not very efficient if large structures are to be enumer- 
ated because many of them treat general graph structures 
(under the valence constraint). 

It might be possible to develop significantly faster algo- 
rithms for the preimage problem if we restrict the class 
of target chemical structures and employ recent tech- 
niques for enumeration of graph structures. Fujiwara 
et al. [28] studied enumeration of tree-like chemical 
graphs that satisfy a given feature vector which speci- 
fies frequency of paths of up to a prescribed length K 
in a chemical compound to be constructed. They pro- 
posed a branch-and-bound algorithm that consists of 
a branching procedure based on the tree enumeration 
algorithm by Nakano and Uno [29,30] and bounding 
operations designed by properties on path frequency 
and atom-atom bonds. They showed by means of com- 
putational experiments on enumeration of alkane iso- 
mers that their algorithm works at least as efficiently 
as the fastest algorithm while using much less memory 
space. 

To reduce the size of the search space, Ishida et al. [31] 
have introduced a new bounding operation, called the 
detachment-cut, based on the result of Nagamochi [23]. 
In this problem formulation, it is required that the path 
frequency of a chemical structure is exactly the same as 
the specified one. However, there does not exist such 
a structure in many cases because a mapping between 
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chemical structures and feature vectors is not surjective 
and thus there are many vectors in a feature space that 
do not have preimages. To seek solutions effectively in 
a relaxed constraint, Shimizu et al. [13] recently intro- 
duced a problem of enumerating tree-like hydrogen- 
suppressed chemical graphs that satisfy one of a given set 
of feature vectors which is specified by a pair of upper 
and lower feature vectors. They proposed a branch-and- 
bound algorithm for the problem, called 1-Phase algo- 
rithm, and afterward Suzuki et al. [14] proposed a more 
efficient and effective algorithm, called 2-Phase algorithm. 
Implementations of these algorithms [13,14] for enumer- 
ating tree-like hydrogen-suppressed/hydrogen-retained 
chemical graphs with given upper and lower bounds 
on path frequencies are available on a web server 
{http://sunflower.kuicr.kyoto-u.ac.jp/tools/enumol2/). 

As shown by Nakano and Uno [29,30], the class of 
trees admits a nice scheme for computer representa- 
tion of their structures (called "left-heavy trees") which 
enables us to generate trees significantly faster (in con- 
stant time per tree) without executing any explicit test on 
the uniqueness of structure representations of temporar- 
ily generated labeled graphs. Development of algorithms 
for enumerating chemical graphs with a "non-tree struc- 
ture" is thereby a challenging task if we still wish to 
attain high computational efficiency as we have achieved 
for enumeration of tree-like chemical graphs, because no 
such effective representation scheme is known for general 
graphs. It should be noted that although polynomial-time 
algorithms have been developed for equivalence test and 
unique representation form problems for bounded degree 
graphs [32,33] and chemical compounds [34], they are not 
directly applicable to efficient enumeration of chemical 
graphs. 

In the NCI database (http://cactus.nci.nih.gov/ncidb2. 
2/), the ratio of the number of chemical compounds 
with ^-augmented tree structures to that of all regis- 
tered chemical compounds is approximately 9%, 22%, 
28%, 20%, and 11% for k = 0,1,2,3, and 4, respectively. 
This implies that we have been able to treat only 9% of 
all of chemical compounds with high computational effi- 
ciency. As the first step toward efficient enumeration of 
non-tree chemical graphs, we consider the problem of 
hydrogen-suppressed chemical graphs with 1-augmented 
tree (monocyclic) structure. If we can solve this problem, 
we can treat 31% (= 9% + 22%) of chemical compounds. 
Although no effective representation scheme is known 
even to 1-augmented trees, we can create a tree by remov- 
ing one edge in the unique cycle in a 1-augmented tree 
(two multiple edges with the same endvertices is not called 
a cycle in this paper). Additionally, 2-Phase algorithm [14], 
which enumerates tree-like hydrogen-suppressed chemi- 
cal graphs, can be used without any major modification 
to enumerate such trees T = G — e with one edge 



deficit from 1-augmented trees G to be constructed. Thus 
the main task is to efficiently test the uniqueness of 
generated labeled 1-augmented trees. To design such a 
procedure, we use a well-reflected definition of a parent 

0- augmented tree T = G — e of a 1-augmented tree G. 
As a result, we can combine the new procedure with 2- 
Phase algorithm to obtain an algorithm for enumerating 
hydrogen-suppressed chemical graphs with 1-augmented 
tree structure from upper and lower bounds on feature 
vectors. 

Method 

Our proposed algorithm is based on existing algorithms 
to enumerate colored trees [29,30] and colored multi- 
trees [13,14,28,31]. The basic strategy of our algorithm 
is to generate a multi-tree first and then extend it to a 

1- augmented tree by adding an edge. In enumeration algo- 
rithms, it is important not to miss any possible structures 
and not to duplicate identical structures. In order to effi- 
ciently cope with these conditions, the concept of the 
family tree has been widely employed in various enumer- 
ation algorithms. To define a family tree for graphs, we 
need to define a parent-child relationship between graph 
structures so that a parent structure is uniquely deter- 
mined from a child structure, where each child structure is 
obtained by adding a vertex or an edge to its parent struc- 
ture. Because extension of a multi-tree to a 1-augmented 
tree is the core part of our proposed algorithm, we need 
to provide a proper definition of the parent-child rela- 
tionship between a multi-tree and a 1-augmented tree. 
As will be shown later, there may exist multiple pos- 
sible ways of having a parent structure. How to define 
the unique parent of a given 1-augmented tree is one 
of the novel points of our proposed algorithm. Another 
important issue on generating 1-augmented trees is not 
to generate identical 1-augmented trees from the same 
multi-tree. As will be shown later, there is a case in which 
additions of different edges result in identical structures. 
How to efficiently prevent this kind of duplicate genera- 
tion of identical structures is the other novel point of our 
proposed algorithm. In the following, we give a detailed 
description of the algorithm including these novel points. 
Again, readers not interested in mathematical details can 
skip this part. 

Overview of a new algorithm for 1 -Augmented trees 

Let G'q be the set of O-augmented trees (multi-trees) 
T = G — e obtained from each 1-augmented tree G € 
Qi (gi,gu) by removing a simple edge e in the unique cycle 
ofG. 

Then we have G' 0 C Qo(g' L ,gu) for a modified lower vec- 
tor^ in a vector set G' L . We construct such a vector set 
G' L bomg L as follows: For each t e E M with k > 2, let 
g' L [i\ = 0; and for each t e £ w , let 
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max{gi [t] — 1, 0} if t is symmetric (i.e., it is identical with its reversal) 
max{^[i] — 2, 0} otherwise. 



Thus our first task is to generate all multi-trees T 6 
Qo(Si'Su) by using fast conventional algorithms such as 
2-Phase algorithm. 

Our next task is to generate 1 -augmented trees G from 
each multi-tree T € Go(g' L ,gu) such that no 1-augmented 
tree in G G Gi(gL>gu) will be duplicated during the entire 
enumeration over all T e Qo(g' L> gu)- To attain this objec- 
tive without storing all generated 1-augmented trees for 
a comparison with a newly generated 1-augmented tree, 
we define a mapping n : Gi(gL.gu) Go(g' L ,gu); the 
multi-tree T = tt(G) for a 1-augmented tree G is called 
the parent of G. For a multi-tree T, a 1-augmented tree 
G with 7r(G) = T is called a child of T (possibly T 
has more than one child), and is called a feasible child 
of T if G e Gi(gL>gu)- Note that any of definition of 
such a mapping will suffice as long as 7t(G) is determined 
only by the information of an "unlabeled graph" G (i.e., 
topological structure) except for a possible difference in 
computational efficiency to avoid duplication of solutions. 

In the following, we show the 2-Phase algorithm, 
present details of our definition of parents n and design 
an efficient procedure for generating all children G from a 
given multi-tree T e Go(g' L ,gu)- 

Summary of 2-Phase algorithm for O-Augmented tree 

In this section, we summarize 2-Phase Algorithm [14] 
for generating all multi-trees in Go(g'iigu)- In the first 
phase, we simplify input feature vectors by adding the 
frequencies of the paths that include multiple edges to 
the corresponding paths which consist of only simple 
edges and then enumerate simple trees for the simplified 
upper and lower feature vectors. Figure 5 illustrates fea- 
ture vectors gu and g' L and simplified feature vectors g u 
and^. 

In the second phase, we assign multiplicities of edges for 
each of the simple trees to satisfy the feature vector con- 
straint and the valence constraint. The inputs and outputs 
of the first phase and second phase in 2-Phase Algorithm 
are described as follows: 
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Figure 5 Illustration of modified feature vectors. The frequency of 
01 C and C1 0 increases after the simplification. 



First Phase 

Input: A color set X, lower and upper feature vectors^ 
and gu, respectively; 

Output: All simple trees in Goig^gu) forgi andg u which 
are simplified fromg| andgu, respectively. 

Second Phase 

Input: A simple tree T e Goigpgu) obtained by the first 
phase, a color set X, and lower and upper feature vectors 
g' L and gu, respectively; 

Output: All multi-trees in Go(g'i>gu) obtained by assign- 
ing a multiplicity to T. 



From the given multi-tree T € Go(g' L :gu)> our efficient 
procedure generates all 1-augmented trees in Gi(gL>gu)- 

Parent-child relationship 

In this section, to avoid duplication of a 1-augmented tree 
during the entire enumeration over all T e Go(gi>gu)> 
we introduce a parent-child relationship between a 0- 
augmented tree and a 1-augmented tree. 

Signature of rooted multi-trees 

To define the parent tt(G) of a 1-augmented tree G using 
only topological structure, we first introduce the concepts 
of "canonical form" and "signature" for a class of multi- 
graphs. 

We fix the total order of colors in X arbitrarily, e.g., 
0<N<C, and regard each color c e E as a small inte- 
ger in Z+. We define the lexicographical order among 
sequences with elements in X UZ + as follows. A sequence 
A = (a\, «2> • • • > ftp) is lexicographically smaller than a 
sequence B = (b\, &2- • • • > b q ) (denoted by A < B) if and 
only if there is an index k such that (i) at = bi (1 < { < k); 
and (ii) a^+i < b/ (+ i (k + 1 < min(p, q}) or k = p < q; 
otherwise A = B, i.e., p = q and at = bi (1 < i < p), or 
B <A. Let A < B denote A < B or A = B. 

A multi-graph is called labeled if each vertex has a 
unique name or an index such as Vq,V\, . . . ,V n —\, and 
we usually record a multi-graph as labeled in our com- 
puter. Hence, testing isomorphism of two multi-graphs is 
to find labels for these "unlabeled graphs" such that the 
two labeled graphs completely match each other includ- 
ing the adjacency between every two vertices. For a class 
G of multi-graphs, if we have a way of choosing a label 
for each multi-graph G € G that is unique up to auto- 
morphisms of G, then we can test the isomorphism of 
two graphs directly with their labels. Such a labeling for G 
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is called the canonical form of G. Once such a canonical 
form is obtained, we can easily encode each multi-graph 
G e G into a code er(G) (which is an integer or a sequence 
of integers/colors), called the signature of G, such that 
two multi-graphs G,G' e Q are isomorphic if and only if 
cr(G) = cr(G'). Without loss of generality we assume a 
total order over {cr(G) | G e Q] by introducing, if nec- 
essary, a total order over all colors and a lexicographical 
(total) order over all sequences of integers and colors. 

A rooted graph is a multi-graph in which a vertex is des- 
ignated as the root, and two rooted graphs are isomorphic 
if there is an isomorphism that maps their roots onto each 
other. 

Any tree T has either a vertex or a pair of adjacent 
vertices removal of which leaves no component with at 
least \V{T)\/2 vertices [35], where the former is called the 
centroid and the latter is called the bicentroid. 

In a rooted multi-tree T, the parent vertex of a non-root 
vertex v is denoted by p(v) and the depth of a vertex v is 
denoted by depth(v), where the depth of a vertex is its dis- 
tance to the root. For a vertex v in T, let T v denote the 
subtree induced from T by all descendants of v including 
v. For an edge e = uv in T (where u = p(y)), let T e (T uv ) 
denote the subtree of T that consists of T v and U = p(v) 
joined by edge e = uv. 

For the class of rooted multi-trees, a canonical form of 
a rooted multi-tree T is given by an "ordered tree" r of 
it (i.e., determination of a total order among children of 
each vertex). Let dfs(r) denote the total order of vertices 
in r visited by the depth-first-search order according to 
the order for children in x. For example, Figure 6 illus- 
trates three ordered trees n, T2 and T3, which are obtained 
from the same multi-tree T rooted at the centroid, where 
the number beside each vertex v indicates dfs(v). 

We let <5(t) denote the alternating sequence 
(en, do, ci, d\, . . . , c n -\, d n -\) such that ci and di denote 
the color and depth, respectively, of the i-th vertex v; in 
dfs(r), andM(r) denote the sequence (mi, mi, ■ ■ ■ , m n -\) 
of the multiplicity m, = miyupiyif) of the edge joining 
the z'-th vertex and its parent p(yi) in T. For a vertex v, 
let dfs(v) denote the labeling number of v in dfs(r). For 



root root root 



depth 0 - - 
depth 1 - - 
depth 2 - - 




(a) 11 (b)l2 (c)t3 

S(ti)=(C,0,C,1,N,2,N,1,O,2,C,1,N,2) S(T2)=8(r3)=(C,0,C,l,N,2,C,l,N,2,N,l,O,2) 
M(t 2 )=( 1 ,2,2, 1 , 1 , 1 ) M(i3)=(2, 1,1,2,1,1) 

Figure 6 Illustration of a rooted tree and left-heavy trees, (a) An 

ordered tree rooted at its centroid; (b) a left-heavy tree T2; and (c) 
the canonical form T3. 



example, Figure 6 illustrates 8(xf) of ordered trees r ( , 
i = 1, 2, 3 and M(r 2 ) and M(r 3 ). 

A left-heavy tree of a rooted multi-tree T is an ordered 
tree r that has the maximum code <5(r) among all 
ordered trees of T (hence a left-heavy tree r is a canon- 
ical form and 5(t) is a signature of it when we ignore 
the multiplicity of rooted multi-trees). We define the 
canonical form of a rooted multi-tree T to be the left- 
heavy tree r that has the maximum code M(x) among 
all left-heavy trees of T, and let a{T) denote a sig- 
nature of T (a code of the canonical form x such as 
(5(r), M(t))). For example, in Figure 6, %i and x$ are left- 
heavy trees of T, since they have lexicographically maxi- 
mum sequences rSfe) = 5 (73) among all ordered trees r of 
the rooted multi-tree T, and 13 is the canonical form of T 
and (<5(r 3 ) = (C, 0, C, 1, N, 2, C, 1, N, 2, N, 1, 0, 2),M(t 3 ) = 
(2, 1, 1, 2, 1, 1)) is the signature of T since it is a left-heavy 
tree with the lexicographically maximum M(xz) among all 
left-heavy trees x of T. 

Using the canonical form for rooted multi-trees, we 
can define a canonical form for "unrooted" multi-trees 
T by regarding them as trees rooted at the centroid or 
bicentroid. 

Defining parents n 

We are now ready to define parents tt for 1 -augmented 
trees (note that there is no root for any 1-augmented tree). 
Let G be a 1-augmented tree with a unique cycle C of 
length p which by our assumption contains at least one 
simple edge. Then there are p possible choices G — e,, 
i = X, 2, . . .,p, for the parent of G. We introduce a rule to 
choose one of them based only on the topological infor- 
mation on G and C. For each vertex v in C, let N(v) denote 
the set of vertices in V — V(C) adjacent to v. Removing an 
edge vw with v G V(C) and w g N(v) leaves a multi-tree 
containing w, which we denote by T w . For each vertex v in 
C, we encode all multi-trees T w , w e N(v) into a signature 
cr*(v) using the signature a for rooted multi-trees; we set 

a*(v) = (c(v),a(T m ),a(T W2 ), . ..,a(T Wh )), 

such that cr(T Wl ) > cr(T W2 ) > ■ ■■ > a(T Wh ) holds 
for N(y) = {wi,W2, ■ ■ ■ ,Wh)- Note that two vertices 
v and v' in C have the same color and an identi- 
cal set of subtrees in N(v) and N(v') if and only if 
cr*(v) = a*{v'). For each simple edge e = uv in C, we 
define a code c*(e) as follows. We encode the unique path 
u\ (= u),ui, . . . ,Uh (= v) from u to v along C into 
<7*(u,v) = (er*(wi), m\,o*{ui), m% . . .,mh-\,cx*(uh)), 
where raj = m{ui, Mj+i). Symmetrically, we define 
cr*0, u) = {<j*(uh),m h _i,<j*{uk-\),m h -2, ■ ■ .,m\,o*(ux)). 
The code c* (e) is defined to be lexicographically the max- 
imum one between two sequences a*(u, v) and cr*(v, u). 
Furthermore, let E* (C) be the set of simple edges e* in 



Suzuki et al. Journal of Cheminformatics 2014, 6:31 
http://www.jcheminf.eom/content/6/1/31 



Page 8 of 18 



C such that c*(e*) is lexicographically maximum among 
c* (e) for all simple edges e in C. 

We call an edge vw with v e V(C) and w e AT(V) a heavy 
edge if r w has at least \V(G)\/2 vertices. We distinguish 
two cases to define parent tt. 

Case 1. There is no heavy edge around C: For an 

arbitrary edge e e E*(C), we define jt(G) to be 
G — e (note that when \E* (C) | > 2, G is 
symmetric around C and G — e and G — e' will 
be isomorphic for any two edges e, e' e £*(C)). 
Figure 7 illustrates how the parent tt(G) of a 
1-augmented tree G in Case 1 is determined on 
these signatures a(u, v) and a(v, u) of simple 
edges uv in the cycle C. 

Case 2. There is a heavy edge v*w*: Note that no other 
edge can be a heavy edge. Let e\ and ei be the 
two edges in C that are adjacent to v*, where at 
least one of them is a simple edge since 
deg(v) < 4. If exactly one of them, e.g., e\ is a 
simple edge, then we define jt(G) to be G — e\. 
When e\ and e2 are simple edges, we choose 
one of them as follows. We first ignore all trees 
T w with w e N(y*), which are symmetric at 
the vertex v* commonly shared by e\ and ei 
and hence useless to construct a signature for 
distinguishing e\ and C2- Without using T w 
with w € N(v*), we construct the code c*(e\) 
and c*(e2). Finally we choose any edge e; such 
that c* (ei) is lexicographically maximum 




Figure 7 Illustration of defining the parent jt(G) of a 
1-augmented tree G. Three multi-trees 7i = G — uw, 72 = G—vz 
and ?3 = G — wz are obtained from G by removing a simple edge in 
the cycle. Each number on the left side of each vertex v in G indicates 
its signature o*(v) of {T w | w e N(v)}.The code a* for each pair of 
adjacent vertices in the cycle of G is given by a*(w, u) = ((C, 2, 1 ), 
1 , (C, 2), 1 , (C, 1 ), 2, (Q),a*(u, w) = ((C), 2, (C, 1 ), 1 , (C, 2), 1 , (C, 2, 1 )), 
(T*0/,z) = ((C,1),2,(C),1,(C,2,1),1,(C,2)),<t*(z,v) = ((C,2),1, 
(C, 2, 1 ), 1 , (C), 2, (C, 1 )),<r*(w,z) = ((C, 2, 1 ), 1 , (C), 2, (C, 1 ), 1 , (C, 2)), 
and cr*(z,w) = ((C, 2), 1 , (C, 1 ), 2, (C), 1 , (C, 2, 1 )).Then tt(G) is 
defined to be fi = G — uw because a*(w, u) is maximum over all of 
these six codes. 



between c*(e\) and c*(e-i), and define jt(G) to 
be G — e,. 

Generating children 

Recall that our algorithm for enumerating 1-augmented 
trees consists of two major stages: the first stage enumer- 
ates all multi-trees T 6 Go(g'iigu) by 2-Phase algorithm, 
and the second stage generates all feasible children G 
for each T e Go(g'i>gu)> i- e -> 1-augmented trees G € 
QiigLtgu) with tt(G) = T. This section describes a pro- 
cedure for generating all children G = T + e of a given 
multi-tree T by adding a new edge e. 

For simplicity, we consider the case where a given multi- 
tree T has the centroid (the case where it is rooted at the 
bicentroid can be treated with a minor technical modifica- 
tion). In the following, we assume that a given multi-tree 
T is represented as its canonical form (a left-heavy tree) 
x rooted at its centroid, and that its sequences S(r) = 
(ci, d\, . . ., c n , d n ) and M{x) = (»)2, m$, . . . , m n ) over the 
labeling dfs(r) have been already computed after the first 
stage (2-Phase algorithm can deliver not only solutions T 
but also t and these sequences together). 

It should be noted that the canonical form of left-heavy 
trees enjoys the following recursive structure. For any ver- 
tex v in T, the subtree T v of T rooted at v induces an 
ordered tree r v from the left-heavy tree r and x v is again 
the canonical form of T v , since dfs(r v ) is a subsequence 
of dfs(r) with consecutive vertices and its ordered pair 
(S(x v ), M(x v )) is also lexicographically maximized over all 
ordered trees of T v . 

Testing generated i -Augmented trees 

Given the left-heavy tree x of a multi-tree T, we add a 
new edge xy for two nonadjacent vertices x,y e V(T) 
(dfs(x) < dfs(y)) to obtain G = T + xy. Let C denote the 
cycle created in G. We check the following condition to 
test whether T is the parent of G or not. 

Case I. C contains the root (centroid) of T: 

(A) a*(xy) is lexicographically maximum 
among a* (e) for all simple edges e in C. 

Case II. Otherwise: 

(B) x is the ancestor of y; 

(C) a*(xy) > er*(e) if the edge e incident to x 
in C is simple. 

Then, we have the following lemma, where the proof is 
given in S 1.1 (of the Additional file 1). 

Lemma 1. For a multi-tree T and two nonadjacent ver- 
tices x,y e V(T), testing whether T = tt(T + xy) can 
be done by checking the above condition in 0(\V(C)\ 2 ) 
time. 
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Avoiding duplication of children 

In previous sections, we have showed that all children (i.e., 
1-augmented trees) of a given multi-tree T can be gen- 
erated by the definition of the parent-child relationship 
between multi-tree and 1-augmented tree. However, if T 
has two isomorphic subtrees T u and T v then we would 
have two children T + xy and T + x'y 1 ', which are isomor- 
phic to each other. To avoid such duplication, we test if 
T + xy is isomorphic to T + x'y' for some other vertices 
x' and y' when we add an edge xy to T. In fact, we do 
not try to find other such pair x' and / explicitly. Instead 
we introduce a rule that we do not generate T + xy by 
any edge xy that has such an "isomorphic" vertex pair x' 
and y on the left hand side of x and y in T. To detect 
this situation efficiently, we first compute data on each 
vertex v in T that indicates whether the left hand side of 
v contains another vertex u such that its subtree T p ( U ) U 
is isomorphic to T p ^ v . Using such data, we show that 
given a vertex pair x and y whether there is an isomor- 
phic pair x' and y' in the left hand side can be checked 
in constant time. In other words, we show an 0(n 2 ) time 
algorithm extracting all the leftmost side vertex pairs 
from T. 

Among all isomorphic vertex pairs, we call the leftmost 
one an "admissible pair" where "isomorphic" means here 
that the connection of vertices in each pair results in an 
isomorphic tree. Figure 8 shows an example of a case 
in which additions of different edges result in identical 
structures and the admissible pair. 

In this section, we show the validity that we only need to 
add an edge between each admissible pair to avoid dupli- 
cation and omission of 1-augmented trees generated from 
one multi-tree T. Finally, for a multi-tree, we provide an 




(a)T+uu is contacted (b)T+uv' is not constructed 

Figure 8 Illustration of admissible and non-admissible pairs. A 

cycle will be created by adding each edge shown by a black dotted 
line. Two 1-augmented trees T+uv and T+uv' are 1-augmented trees 
obtained from the multi-tree T by adding a simple edge such that 
T + uv and T + uv' are isomorphic to each other, (a) The vertex pair 
(u, v) is an admissible pair, and (b) the other is not an admissible pair 
(at least one admissible pair always exists in every 1-augmented tree). 
Therefore, T + uv' is not created (all 1 -augmented trees are discarded 
except for the 1 -augmented tree created for admissible pair). 



efficient algorithm extracting all vertex pairs to generate 
all children of the multi-tree. 

Admissible pairs 

We write T + uv ~ T + u'v 1 if and only if T + uv and 
T + u'v 1 are isomorphic. For a tree T, let ct denote its 
centroid, which is either a vertex (unicentroid) or an edge 
(bicentroid). Let T be a left-heavy tree rooted at its cen- 
troid c-f. When ct is a bicentroid rr ', r and r' will be the 
vertices that have no parent in the parent-child relation- 
ship in T. We shall now introduce "rooted-isomorphism" 
among 1-augmented trees obtained from T by adding a 
new edge. We regard a 1-augmented tree G = T + uv 
obtained by adding new edge uv between two nonadjacent 
vertices u,v e V{T) as a graph rooted at ct- When ct is a 
vertex r, we say that two 1-augmented trees G = T + uv 
and G' = T + u'v 1 are rooted-isomorphic if they admit 
an isomorphism xfr such that ct in G = T + uv corre- 
sponds to ct in G' = T + u'v' (i.e., ir{r) = r when ct is 
a vertex r, and {if(r), V f ('"')} = [r$ r'} when ct is an edge 
rr 1 ) . We write T + uv «a T + u'v 1 if and only if T + uv 

r 

and T + u'v 1 are rooted-isomorphic with root r. Then, 
the following theorem holds, where the proof is given in 
Additional file 1: S1.2. 

Theorem 2. Let T be a left-heavy tree rooted at its 
centroid ct and {u, v}, {u' , V} C V(T) be two pairs of non- 
adjacent vertices. If T + uv ~ T + u'v' then T + uv «a 

r 

T + u'v'. 

Theorem 2 tells us that two 1-augmented trees G = T + 
uv and G' = T + u'v' are isomorphic if and only if they 
are rooted-isomorphic (i.e., ct in G corresponds to ct in 
G' in the isomorphism iff, where possibly i/f(r) = r' and 
i/r (r') = r when ct = rr'). 

Now we consider how to generate a set Qt of 1- 
augmented trees T + uv such that the 1-augmented 
tree T + uv for any pair of nonadjacent vertices u, v e 
V(T) is isomorphic to exactly one 1-augmented tree G 
in the set Qt- By Theorem 2, we only need to check 
the rooted-isomorphism among 1-augmented trees T + 
uv for all pairs of nonadjacent vertices u,v g V(T). 
Based on this, we can modify a given tree T with bicen- 
troid ct = rr' into a tree T' with unicentroid r* by 
inserting a new vertex on the edge rr'. Since this does 
not change the rooted-isomorphism among 1-augmented 
trees T' + uv or the left-heaviness of T, we assume 
in the following that a given tree T has a unicentroid 
ct = r. 

Let T be a left-heavy tree. We shall introduce some ter- 
minology. Let x be a non-root vertex x in T . Denote by 
left (jc) the immediate left sibling of a non-root vertex x (if 
any). We define data copy as follows. 
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COpyOO 



1 if left(x) exists and T y T x holds (i.e., x is a copy of y) for y = left(x), 
0 otherwise. 



Let u and v be two vertices in T. We denote by P(m, v) the 
unique path in T that connects w and v, where P(u, v) = 
P(y, u). Let lca(w, v) denote the least common ancestor of 
u and v, i.e., the highest vertex in P(u, v) (where we define 
lca(w, v) to be the edge cr = rr' when T is rooted at the 
bicentroid cr = rr', and u e V{T r ) and v e V(T r /)). 
When dfs(w) < dfs(v), we define the greatest uncommon 
ancestor gua of u and v as follows: 

Let gua(w, v) denote the child of lca(w, v) that is 

closest to u in T, where gua(w, v) is an ancestor of u 

(including u itself) if lca(w, v) ^ u; 

Let gua(v, u) denote the child of lca(w, v) that is an 

ancestor of v (including v itself), where 

gua(w, v) = gua(v, u) if lca(w, v) = u. 

We call a pair of nonadjacent vertices u, v e V(T) with 
dfs(w) < dfs(v) admissible if it satisfies the following con- 
ditions (see Figure 9 for conditions (1) and (2) and 
Figure 10 for condition (3)): 

(1) copy(w) = 0 for all vertices w e V(P(\ca(u, v), r)) — {r}; 

(2) copy(w) = 0 for all vertices w e V(P(u, gua(w, v))) U 
V(P(v, gua(v, u))) — {lca(w, v), gua(v, u)}; 

(3) if copy(gua(v, u)) = 1 then 

(i) gua(w, v) = left(gua(v, u)) (hence u ^ 
lca(w, v)); and 

(ii) For the copy u of vertex u in Tg Ua ( ViM ), it holds 
dfs(v) > dfs(w) (where dfs(w) = dfs(«)+ 
|V(r gua(M)V) )|). 

Note that (3)-(i) implies that copy(gua(v, u)) in (2) needs 
to be 0 when lca(w, v) = u. 

The next lemma indicates that we only need to add an 
edge between each admissible pair to avoid duplication of 
1-augmented trees, where the proof is given in Additional 
file 1: S1.3. 

Lemma 3. For a left-heavy tree T rooted at its unicen- 
troid cr = r, let Gt = {T + uv \ admissible pairs u, v e 
V(T)}. Then the 1-augmented tree T + uv for any pair of 
nonadjacent vertices u, v e V{T) is isomorphic to exactly 
one 1-augmented tree G in Gt- 

Algorithm 

In this section, we describe an algorithm of the second 
stage to generate all children of a given multi-tree T 
without duplication and omission, and show the compu- 
tational complexity of the second stage. To generate all 



children of T, we first find all admissible pairs (u, v) for 
T and test whether T is the parent tt(T + uv) of T + uv 
or not. Notice that a straightforward method would take 
0(«) time to check whether a pair (u, v) is admissible or 
not. Since there are at most n Ci vertex pairs in a multi-tree 
T, finding all admissible pairs for T may take 0(« 3 ) time. 
That is, from Lemma 1, we may need 0(« 3 | V(C)\ 2 ) = 
0(n 5 ) time to generate all children of T. 

In what follows, we design a faster 0(« 4 )-time algorithm 
to generate all children of a given multi-tree T. For this, 
we find only a subset of all admissible pairs, called the set 
of "candidate" pairs defined as follows (see also Figure 11). 
We see that no pair (x, y) generates a child T + xy of T if 
a heavy edge is created in T + xy and x is not an ances- 
tor of y, since such (x, y) does not satisfy any of Cases I 
and II for generating children of T. Hence we are not 
interested in storing such pairs (x, y), and call an admis- 
sible pair (x, y) a candidate pair when (i) no heavy edge 
is created in T + xy; or (ii) x is an ancestor of y, where 
(i) (resp., (ii)) is a necessary condition of Case I (resp., 
Case II). By definition, every candidate vertex pair (x, y) 
is admissible, whereas any admissible pair (x, y) such that 
T + xy is a child of T is always a candidate pair. There- 
fore, to generate all children of T, we do not need to find 
all admissible pairs and only have to extract all candidate 
pairs. 

To facilitate this, we examine all vertex pairs (u, v) 
(dfs(w) < dfs(v)) in T in a lexicographical order with 
respect to (dfs(w), dfs(v)), i.e., we choose each vertex 
Vi from vq to v n -\ as u and then choose each vertex 
Vj from to v n -\ as v. We call the lexicographi- 
cal order over vertex pairs a dfs order. For each of the 
generated vertex pairs, we check whether it is a candi- 
date pair or not. Finally, for each candidate pair (u, v), 
we test whether T + uv is a child of T in Case I 
or II. 

We can find all candidate pairs for a multi-tree T in 
0(« 2 ) time in total as stated below. The proof is given in 
Additional file 1: S1.4. 

Lemma 4. For a left-heavy multi-tree T, all candidate 
pairs ofT can be found in 0(« 2 ) time. 

Finally, by Lemma 1 and Lemma 4, we can generate all 
children of a multi-tree T in 0(« 4 ) time, as stated in the 
next lemma. 

Lemma 5. Given a left-heavy multi-tree T, all children 
ofT can be generated in 0(« 4 ) time. 



Suzuki etal. Journal of Cheminformatics 2014, 6:31 
http://www.jcheminf.eom/content/6/1/31 



Page 11 of 18 





o r 




W({ 


r\\c&(u, v) 


fflca(u, v)=u 


gua(w,u)>r/...*^V / 
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I copy(gua(u,u))=0 


• 7sgua(u,u) 

J 1 copy(gua(u,u))=0 
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• 1 
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are not isomorphic 




(a) \ca(u,v)^u 


(b) lca(u ? i/)=u 


Figure 9 Illustration of conditions (1) and (2) for admissible pairs. The black dotted line joins two vertices u and v, which will be the edge to 
create a cycle in the 1 -augmented tree T + uv. (a) and (b) illustrate the case of lca(u, v) ^ u (where two rooted subtrees 7(gua(u, v)) and 
f(gua(i/, u)) are not isomorphic to each other by copy(gua(v, u)) = 0) and the case of lca(u, v) = u, respectively (note that \ca(u,v) ^ v by 
dfs(u) < dfs(i/)). Conditions (1) and (2) exclude any ancestor w of u or v such that copy(w) = 1 (otherwise we can prove that there is a 
lexicographically smaller pair (u', v') such that T + u'v' is isomorphic to T + uv). 



Proof. For a left-heavy multi-tree T, we can find all 
candidate pairs in 0(n 2 ) time by Lemma 4. For each can- 
didate pair (u, v), we can test whether T = jt(T + uv) 
or not in 0(|K(C)| 2 ) time by Lemma 1. Thus, for a left- 
heavy multi-tree T, all children of T can be generated in 
0(n 2 \V(C)\ 2 ) = 0(« 4 ) time. □ 

Experimental and results 

This section reports experimental results of our algo- 
rithm enumerating 1-augmented trees. Tests were carried 
out on a PC with an Intel Core i5 processor running at 
3.20 GHz and the Linux operating system using the C 
language, employing instances based on chemical com- 
pounds selected from the KEGG LIGAND database [15] 
(http://www.genome.jp/ligand/). 

(I) First we select four chemical compounds "C00062," 
"C03343," "C03690," and "C07178" as chemical graphs 
with O-augmented tree (acyclic) structure and four 
chemical compounds "C00095," "C00270," "C00645," and 
"C00837" as chemical graphs with 1-augmented tree 
structure (see Additional file 1: Figure S21 for illustrations 
of these chemical graphs), wherein each benzene ring in 
chemical compounds "C03343," "C03690," and "C07178" is 
regarded as a virtual atom b of valence 6. These com- 
pounds are heuristically selected based on the following 
criteria: (i) each compound is a O-augmented tree or 
1-augmented tree (except benzene ri ngs), (ii) each com- 
pound consists of C,0,H (or, C,0,N,H) atoms, (iii) com- 
pounds are not very similar to each other, and (iv) 
compounds have varying sizes but are not too large. 



The virtual atom b is treated as one atom so that we 
discard all possible regioisomers of benzene. Thus in our 
experiment, we consider the cycles not caused by ben- 
zenes but by other substructures in these 1-augmented 
trees. We remark that an efficient algorithm has been 
developed for generating all possible regioisomers of a 
given O-augmented tree structure with virtual atoms b by 
Li et al. [36], and an implementation of the algorithm is 
available on a web server (http://sunflower.kuicr.kyoto-u. 
ac.jp/tools/ enumol2/) . 

To generate problem instances from each of the selected 
chemical graphs, we define w e Z + to be a width between 
upper and lower feature vectors. From the feature vector 
g = /k(G) of a chemical graph G at level K, we construct 
two feature vectors gu and gi of width w as follows. For 
each entry t withg[t] > 1, letgu[t] = g[t] +w and gi[t] = 
max{g[t] — w, 0}; and for each entry t with g[t] = 0, let 
gu[t] = gL,[i\ = 0. See Additional file 1: Figure S23 (resp., 
Additional file 1: Figure S24) for the lower and upper fea- 
ture vectors gi and gu with JT = 1 and w = 1 (resp.,/<C = 2 
and w = 1) created from C00062. 

To examine the computational efficiency, we com- 
pare the time per output multi-tree/l-augmented tree 
by our algorithm and by 2-Phase algorithm [14]. Our 
algorithm enumerates not only the 1-augmented trees in 
Si(gL,gu) but also the multi-trees in Qo(gL,gu)- There- 
fore, if time per output graph of our algorithm is close 
to that of 2-Phase algorithm, then we can enumerate 1- 
augmented trees as fast as 2-Phase algorithm enumerates 
multi-trees. 
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are isomorphic 

Figure 10 Illustration of condition (3) for admissible pairs. The 

black dotted line joins two vertices u and v, which will be the edge to 
create a cycle in the 1 -augmented tree 7" + uv. Condition (1 ) and (2) 
exclude any ancestor w of u or v such that copy(w) = 1 except for 
w = gua(v, u). The assumption copy(gua(v, u)) = 1 requires the 
rooted subtree Fg ua (ij,v) to be isomorphic to Tg ua ( VjU ) and condition 
(3)-(i) requires 7g ua ( UiV .) to be located immediately on the left of T gu a(y,u) 
among the subtrees rooted at the children of lca(u, vj.Then 7g ua ( ViU ) 
contains a copy u of u (i.e., dfs(u) = dfs(u) + |Tg Ua ( u ,v)|). Similarly 
7gua(u,v) contains a copy vofv (i.e., dfs(v) = dfs(v) - \T gua(UiV) |). 
Condition (3)-(ii) requires dfs(D) < dfs(v) (otherwise we can prove 
that (v,u) is a lexicographically smaller pair such that T + vu is 
isomorphic to T + uv). Although 7g Ua ( UiV ) and 7g ua ( v , u ) are isomorphic 
to each other, only paths and nodes relevant for explanation are 
shown in this figure. 



Table 1 shows the result of the comparison of 2-Phase 
algorithm and our algorithm for varying K with fixed 
W = 1, where the meanings of columns are as follows. 

Note on tables: 

(1) C00062, C00095, C00270 C00645, C00837, C03343, 
C07178, and C03690 are the chemical compounds in 
the KEGG LIGAND database, respectively; 

(2) in Table 1, the width for constructing upper and 
lower feature vectors is 1; 

(3) n is the number of vertices without hydrogen atoms 
in an instance preprocessed by replacing each 
benzene ring with a new atom with six valences; 

(4) w is the width for constructing upper and lower 
feature vectors; 

(5) K is the level of given feature vectors; 

(6) "time (s)" is the CPU time in seconds; 

(7) T.O. means the "time over" (the time limit is set to be 
1,800 seconds); 

(8) "time/graph" is the time per enumerating one graph; 

(9) "tree" is the number of all possible solutions of 
tree-like chemical graph in the time limit; 

(10) "cycle" is the number of all possible solutions of 
1-tree chemical graph in the time limit; 

(11) "ratio" is a number such that "time/graph" of our 
algorithm is divided by that of 2-Phase algorithm; 

(12) for any real numbers x and y, let xEy denote x x 10 y . 

It is to be noted that in some instances, the number of 
enumerated trees by 2-Phase algorithm and that of our 
algorithm are different because of the time limit. Hence, 




Figure 1 1 Illustration of candidate pairs. A graph G will be created by adding each edge shown by a black dotted line. However, no vertex pair 
for any of edges a, band c is a candidate pair. We do not have to add edges a, b, and c to J\,Ji, and T3, respectively because neither Q — a,G — b, 
nor G — c can be the parent of G. 
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Table 1 Comparison of 2-Phase algorithm and our algorithm in chemical graphs (I) 



Entry 2-Phase Algorithm Our Algorithm 



Formula 


n 


K 


Tree 


Time 


Time/graph 


Tree 


Cycle 


Time 


Time/graph 


Ratio 






1 


388,192 


0.288 


741.9E-9 


388,192 


1 ,786,467 


14.056 


6.5 E-6 


8.7 






2 


614 


0.021 


34.2 E-6 


614 


229 


0.134 


159.0E-6 


4.6 


C00062 




3 


95 


0.020 


210.5E-6 


95 


0 


0.039 


41 0.5 E-6 


2.0 


C6H 14 N 2 04 


12 


4 


1 


0.008 


8.0E-3 


1 


0 


0.018 


18.0E-3 


2.3 






5 


1 


0.005 


5.0E-3 


1 


0 


0.006 


6.0E-3 


1.2 






6 


1 


0.006 


6.0E-3 


1 


0 


0.006 


6.0E-3 


1.0 






7 


1 


0.004 


4.0E-3 


1 


0 


0.004 


4.0E-3 


1.0 






1 


1,708 


0.007 


4.1 E-6 


1,708 


12,626 


0.050 


3.5E-6 


0.9 






2 


50 


0.004 


80.0E-6 


50 


1,085 


0.043 


37.9E-6 


0.5 


C00095 




3 


0 


0.046 




0 


286 


0.046 


1 60.8E-6 




Cfj H 1 2O5 


12 


4 


0 


0.006 




0 


19 


0.035 


1 .8E-3 








5 


0 


0.004 




0 


7 


0.030 


4.3E-3 








6 


0 


0.004 




0 


5 


0.028 


5.6E-3 








7 


0 


0.004 




0 


5 


0.013 


2.6E-3 








1 


5,446,987 


7.690 


1 .4 E-6 


5,446,987 


31,395,098 


217.681 


5. 9 E-6 


4.2 






2 


373 


0.022 


59.0E-6 


373 


71 


0.171 


385.1 E-6 


6.5 


C03343 




3 


187 


0.023 


123.0E-6 


187 


25 


0.071 


334.9E-6 


2.7 


C16H22O4 


15 


4 


101 


0.022 


217.8E-6 


101 


9 


0.042 


381.8E-6 


1.8 






5 


51 


0.022 


431.4E-6 


51 


6 


0.059 


1 .0E-3 


2.4 






6 


43 


0.009 


209.3E-6 


43 


0 


0.036 


837.2E-6 


4.0 






7 


28 


0.013 


464.3 E-6 


28 


0 


0.020 


714.3E-6 


1.5 






1 


2,926,382 


2.878 


983.5E-9 


2,926,382 


23,965,432 


146.669 


5.5 E-6 


5.5 






2 


41,468 


1.035 


25.0E-6 


41,468 


213,820 


37.792 


148.0E-6 


5.9 


C00645 




3 


491 


0.562 


1.1 E-3 


491 


4,482 


6.281 


1.3E-3 


1.1 


C 8 H 15 N0 5 


15 


4 


0 


0.523 




0 


73 


4.168 


57.1 E-3 








5 


0 


0.374 




0 


5 


2.320 


464.0E-3 








6 


0 


0.135 




0 


3 


0.430 


143.3E-3 








7 


0 


0.121 




0 


1 


0.328 


328.0E-3 








1 


167,172,180 


238.554 


1 .4E-6 


> 1 ,594,520 


> 33,962,677 


T.O. 


50.7E-6 


35.5 






2 


210 


1.232 


5.9E-3 


210 


641 


1.888 


2.2E-3 


0.4 


C00837 




3 


0 


0.853 




0 


4 


1.206 


301.5E-3 




C 8 Hi 8 N 6 04 


18 


4 


0 


0.445 




0 


2 


0.523 


26 1.5 E-3 








5 


0 


0.389 




0 


1 


0.596 


596.0E-3 








0 


U 


u.zyo 




V 


1 


U.io/ 


DO/.UtO 








7 


0 


0.285 




0 


1 


0.395 


395.0E-3 








1 


62,234,720 


321.155 


5.2E-6 


> 4,812,773 


> 40,426,928 


T.O. 


39.8E-6 


7.7 






2 


884 


0.310 


350.7E-6 


884 


180 


1.824 


1 .7E-3 


4.9 


C07178 




3 


22 


0.026 


1.2 E-3 


22 


4 


0.099 


3.8E-3 


3.2 


C 2 ,H28N 2 05 


18 


4 


1 


0.004 


4.0E-3 


1 


0 


0.005 


5.0E-3 


1.3 






5 


1 


0.004 


4.0E-3 


1 


0 


0.005 


5.0E-3 


1.3 






6 


1 


0.004 


4.0E-3 


1 


0 


0.005 


5.0E-3 


1.3 






7 


1 


0.005 


5.0E-3 


1 


0 


0.005 


5.0E-3 


1.0 
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Table 1 Comparison of 2-Phase algorithm and our algorithm in chemical graphs (I) (Continued) 







1 


> 1,208446,991 


T.O. 


1.5E-6 


> 7,009,856 


> 47,008 


T.O. 


255.1 E-6 


171.2 






2 


27,312,856 


965.131 


35.3E-6 


> 337,989 


> 3,593,865 


T.O. 


458.1 E-6 


13.0 


C00270 




3 


156,073 


391.611 


2.5E-3 


> 1,546 


> 234,187 


T.O. 


7.6E-3 


3.0 


CiiH 19 N0 9 


21 


4 


0 


299.393 


_ 


>o 


> 165 


T.O. 


10.9E+0 








5 


0 


208.268 


_ 


>o 


> 7 


T.O. 


257.1 E+0 


_ 






6 


0 


19.361 


_ 


0 


2 


85.109 


42.6E+0 


_ 






7 


0 


9.720 


- 


0 


2 


30.301 


15.2E+0 


- 






1 


> 664,049,939 


T.O. 


2.7E-6 


> 4,621,297 


> 33,216,732 


T.O. 


47.6E-6 


17.5 






2 


1 64,885 


34.357 


208.4E-6 


1 64,885 


1,425 


213.810 


1.3E-3 


6.2 


C03690 




3 


32,995 


15.978 


484.3 E-6 


32,995 


179 


66.612 


2.0E-3 


4.1 


C24H38O4 


23 


4 


3,884 


2.265 


583.2E-6 


3,884 


17 


5.383 


1 4E-3 


2.4 






5 


1,237 


1.490 


1 .2E-3 


1,237 


13 


3.466 


2.8E-3 


2.3 






6 


559 


0.773 


1 4E-3 


559 


0 


1.554 


2.8E-3 


2.0 






7 


177 


0.445 


2.5E-3 


177 


0 


0.617 


3.5E-3 


1.4 



the "tree" and "cycle" columns show the number of incom- 
plete solutions in instances whose "time" column is "T.O.". 
However, this is not a critical issue because we mainly 
want to know the "time per graph" and its "ratio" between 
2-Phase algorithm and our algorithm. We can make use of 
them as beneficial results from "tree," "cycle," and "time" 
columns even if they are incomplete and "T.O.". 

We find that almost all instances solved within the 
time limit by 2-Phase algorithm are also solved by our 
algorithm within the time limit. Moreover, the "ratio" of 
instances is less than 10 except 4 out of 38 cases, and 
that of many instances is less than 5. This means that 
the time per output by our algorithm is close to that 
by 2-Phase algorithm. Therefore, we have demonstrated 
that our algorithm maintains the high computational effi- 
ciency of 2-Phase algorithm even if K changes. Note that 
our algorithm does not output any 1 -augmented trees in 
Gi(gL.gu) in "C00062," "C03343," "C07178," and "C03690" 
when K is large. This is because the instances are acyclic 
chemical compounds: 1-augmented trees become less 
able to satisfy the feature vector constraint as K increases 
and only multi-trees can satisfy the feature vector 
constraint. 

Table 2 shows the result of the comparison of 2-Phase 
algorithm and our algorithm for varying w with fixed 
K = 3. Just like with Table 1, almost all instances 
solved within the time limit by 2-Phase algorithm are 
also solved by our algorithm within the time limit. The 
"ratio" of instances is less than 10 except 7 out of 48 
cases, and that of many instances is less than 5. In par- 
ticular, with respect to "C00095," "C00645," and "C00837," 
which have 1-augmented tree structure, the "ratio" is 
less than 1 or close to 1. This implies that our algo- 
rithm can enumerate 1-augmented trees and multi-trees 
faster than 2-Phase algorithm enumerates multi-trees. 



These results mean that the time per output by our algo- 
rithm is close to that by 2-Phase algorithm. Therefore, we 
have demonstrated that our algorithm maintains the high 
computational efficiency of 2-Phase algorithm even if w 
changes. 

Finally, from Table 1 and Table 2, we compare 2- 
Phase algorithm and our algorithm in terms of varying 
«, where n is the size of an instance. Note that n is 
the number of vertices without hydrogen atoms in an 
instance preprocessed by replacing each benzene ring 
with a new atom with six valences. We notice that there 
is no large difference in the "ratio" between all cases in 
spite of the fact that the instance size of C03690 is almost 
twice as large as that of C00062. This implies that our 
algorithm maintains the high computational efficiency 
of 2-Phase algorithm even if the instance size becomes 
large. 

(II) Next we select four chemical compounds, prosta- 
glandin (D08040), allobarbital (D00332), gabapentin 
(D00555), and histamine (D00079) as chemical graphs 
with 1-augmented tree structure (see Additional file 1: 
Figure S22 for illustrations of these chemical graphs), all 
of which are existing drug compounds. We conducted the 
same experiment as we did for (I): Table 3 shows the result 
of the comparison of 2-Phase algorithm and our algorithm 
for varying K with fixed w = 1; Table 4 shows the result of 
the comparison of 2-Phase algorithm and our algorithm 
for varying w with fixed K = 3. In this experiment, we 
observe that there still is no large difference in the "ratio" 
between all cases except for the instance of D00079. 

Discussions and conclusions 

We considered the problem of enumerating all chemi- 
cal graphs of 1-augmented tree structure from a given 
set of path-frequency based feature vectors specified by 
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Table 2 Comparison of varying width w in chemical graphs (I) 



Entry 2-Phase Algorithm Our Algorithm 



Formula 


n 


w 


K 


Tree 


Time 


Time/graph 


Tree 


Cycle 


Time 


Time/graph 


Ratio 






1 


3 


95 


0.020 


210.5E-6 


95 


0 


0.039 


41 0.5 E-6 


2.0 






2 


3 


862 


0.020 


23.2E-6 


862 


100 


0.130 


135.1 E-6 


5.8 


C00062 




3 


3 


2,531 


0.030 


11.8E-6 


2,531 


894 


0.213 


62.2E-6 


5.3 


C 6 H,4N 2 04 


12 


4 


3 


3,611 


0.044 


12.2E-6 


3,611 


2,737 


0.254 


40.0E-6 


3.3 






5 


3 


4,438 


0.044 


9.9E-6 


4,438 


5,454 


0.265 


26.8E-6 


2.7 






50 


3 


5,138 


0.045 


8.8E-6 


5,138 


1 2,044 


0.388 


22.6E-6 


2.6 






1 


3 


0 


0.005 




0 


286 


0.046 


1 60.8E-6 








2 


3 


40 


0.065 


1.6E-3 


40 


1,569 


0.065 


40.4E-6 


0.0 


C00095 




3 


3 


280 


0.009 


32.1 E-6 


280 


4,899 


0.089 


1 7.2E-6 


0.5 


C6H12O6 


12 


4 


3 


855 


0.010 


11.7E-6 


855 


8,273 


0.148 


16.2E-6 


1.4 






5 


3 


1,502 


0.011 


7.3 E-6 


1,502 


1 2,085 


0.177 


13.0E-6 


1.8 






50 


3 


4,608 


0.012 


2.6E-6 


4,608 


23,686 


0.186 


6.6E-6 


2.5 






1 


3 


187 


0.023 


1 23.0E-6 


187 


25 


0.071 


334.9E-6 


2.7 






2 


3 


2,077 


0.091 


43.8E-6 


2,077 


1,251 


0.880 


264.4E-6 


6.0 


C03343 




3 


3 


5,345 


0.201 


37.6E-6 


5,345 


5,746 


3.134 


282.6E-6 


7.5 


C16H22O4 


15 


4 


3 


10,391 


0.346 


33.3E-6 


10,391 


16,912 


4.041 


1 48.0E-6 


4.4 






5 


3 


14,531 


0.482 


33.2E-6 


14,531 


33,064 


5.887 


1 23.7E-6 


3.7 






50 


3 


19,819 


0.655 


33.0E-6 


19,819 


94,725 


7.833 


68.4E-6 


2.1 






1 


3 


491 


0.562 


1.1 E-3 


491 


4,482 


6.281 


1.3E-3 


1.1 






2 


3 


7,846 


1.122 


143.0E-6 


7,846 


76,261 


17.199 


204.5 E-6 


1.4 


C00645 




3 


3 


151,227 


2.420 


16.0E-6 


151,227 


716,216 


39.476 


45.5E-6 


2.8 


C 8 H 15 N0 5 


15 


4 


3 


216,507 


2.946 


1 3.6E-6 


216,507 


1,270,462 


33.842 


22.8E-6 


1.7 






5 


3 


272,898 


3.405 


12.5 E-6 


272,898 


1,757,010 


40.323 


19.9E-6 


1.6 






50 


3 


355,958 


3.985 


11. 2 E-6 


355,958 


2,625,154 


43.002 


14.4E-6 


1.3 






1 


3 


0 


0.853 




0 


4 


1.206 


301.5E-3 








2 


3 


389 


1.569 


4.0E-3 


389 


660 


2.496 


2.4E-3 


0.6 


C00837 




3 


3 


2,510 


1.999 


796.4E-6 


2,510 


3,173 


3.367 


592.5E-6 


0.7 


C 8 H, 8 N 6 0 4 


18 


4 


3 


8,544 


2.314 


270.8E-6 


8,544 


1 2,834 


3.994 


186.8E-6 


0.7 






5 


3 


13,796 


2.465 


1 78.7E-6 


13,796 


27,186 


4.841 


11 8.1 E-6 


0.7 






50 


3 


24,313 


2.683 


1 1 0.4E-6 


24,313 


94,089 


5.540 


46.8E-6 


0.4 






1 


3 


22 


0.026 


1 .2E-3 


22 


4 


0.099 


3.8E-3 


3.2 






2 


3 


386 


0.058 


150.3E-6 


386 


261 


0.426 


658.4E-6 


4.4 


C07178 




3 


3 


2,376 


0.089 


37.5 E-6 


2,376 


1,288 


0.735 


200.6E-6 


5.4 




1 8 


4 


3 


4 092 


0.1 02 


24.9E-6 


4 092 


2 240 


0.863 


1 36.3E-6 


5.5 






5 


3 


4,629 


0.109 


23.5E-6 


4,629 


2,385 


1.284 


183.1 E-6 


7.8 






50 


3 


5,103 


0.115 


22.5E-6 


5,103 


2,603 


0.980 


1 27.2E-6 


5.6 






1 


3 


156,073 


391.611 


2.5E-3 


> 1,546 


> 234,187 


T.O. 


7.6E-3 


3.0 






2 


3 


12,515,364 


1331.770 


1 06.4E-6 


> 93,244 


> 1,107,707 


T.O. 


1.5E-3 


14.1 


C00270 




3 


3 


> 88,182,895 


T.O. 


20.4E-6 


> 4,350,635 


> 1,135,547 


T.O. 


328.3E-6 


16.1 


CnH 19 N0 9 


21 


4 


3 


> 134,281,382 


T.O. 


1 3.4E-6 


> 5,230,515 


> 2,086,287 


T.O. 


246.0E-6 


18.4 






5 


3 


> 169,965,948 


T.O. 


10.6E-6 


> 5,213,010 


> 2,383,696 


T.O. 


237.1 E-6 


22.4 






50 


3 


> 254,637,067 


T.O. 


7.1 E-6 


> 7,025,893 


> 1 2,785,700 


T.O. 


90.9E-6 


12.9 
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Table 2 Comparison of varying width w in chemical graphs (I) (Continued) 







1 


3 


32,995 


15.978 


484.3 E-6 


32,995 


179 


66.612 


2.0E-3 


4.1 






2 


3 


2,472,133 


149.048 


60.3E-6 


> 1,763,123 


> 702,493 


T.O. 


730.0E-6 


12.1 


C03690 




3 


3 


13,120,833 


509.010 


38.8E-6 


> 2,416,279 


> 2,028,470 


T.O. 


405.0E-6 


10.4 


C24H38O4 


23 


4 


3 


43,379,162 


1289.269 


29.7E-6 


> 1,815,035 


> 4,297,658 


T.O. 


294.5 E-6 


9.9 






5 


3 


> 80,447,027 


T.O. 


22.4E-6 


> 2,828,014 


> 10,431,681 


T.O. 


135.7E-6 


6.1 






50 


3 


> 111,576,848 


T.O. 


16.1 E-6 


> 3,580,958 


> 55,327,406 


T.O. 


30.6E-6 


1.9 



upper and lower feature vectors, and proposed a new 
exact algorithm by extending 2-Phase algorithm [14]. The 
experimental results reveal that the computational effi- 
ciency of the new algorithm remains high, considering the 
hardness of treating 1 -augmented trees compared with 
0-augmented trees. 



One of our future works is to introduce new bounding 
operations for 1-augmented trees in 2-Phase algorithm 
and our procedure for creating a cycle. Additionally, it 
is important to extend the proposed algorithm for enu- 
meration of A'-augmented trees with k > 2 because we 
can cover 59%, 79%, and 90% of chemical compounds by 



Table 3 Comparison of 2-Phase algorithm and our algorithm in chemical graphs (II) 



Entry 2-Phase Algorithm Our Algorithm 



Formula 


n 


K 


Tree 


Time 


Time/graph 


Tree 


Cycle 


Time 


Time/graph 


Ratio 






1 


2,609 


0.006 


2.3E-6 


2,609 


1 1,263 


0.036 


2.6E-6 


1.1 






2 


193 


0.007 


36.3E-6 


193 


165 


0.015 


41.9E-6 


1.2 


D08040 




3 


14 


0.009 


642.9E-6 


14 


5 


0.014 


736.8E-6 


1.1 


C5H9N3 


8 


4 


9 


0.005 


555.6E-6 


9 


2 


0.006 


545.5E-6 


1.0 






5 


4 


0.004 


1 .0E-3 


4 


1 


0.005 


1 .0E-3 


1 .0 






6 


4 


0.004 


1 .0E-3 


4 


1 


0.005 


1 .0E-3 


1.0 






7 


1 


0.005 


1.3E-3 


4 


1 


0.006 


1 .2E-3 


0.9 






1 


17,470 


0.127 


7.2E-6 


1 7,470 


264,326 


1.446 


5.1 E-6 


0.7 






2 


1,183 


0.023 


1 9.4E-6 


1,183 


16,233 


0.294 


1 7.9E-6 


0.9 


D00332 




3 


30 


0.017 


566.7E-6 


30 


1,318 


0.170 


126.1 E-6 


0.2 


C9HH7NO2 


12 


4 


0 


0.025 




0 


292 


0.112 


383.6E-6 








5 


0 


0.034 




0 


41 


0.090 


2.2E-3 








6 


0 


0.021 




0 


12 


0.050 


4.1E-3 








7 


0 


0.016 




0 


8 


0.037 


4.6E-3 








1 


54,072,616 


200.647 


3.7E-6 


>1 1,645,1 78 


>1 10,673,601 


T.O. 


14.7E-6 


4.0 






2 


68,253 


5.458 


80.0E-6 


68,253 


321,853 


130.500 


334.2E-6 


4.2 


D00555 




3 


4,590 


4.115 


896.5 E-6 


4,590 


6,511 


70.821 


6.4E-3 


7.1 


CiiH 18 N 2 0 3 


16 


4 


91 


2.702 


29.7E-3 


91 


278 


28.632 


77.5E-3 


2.6 






5 


0 


1.438 




0 


38 


15.129 


397.3E-3 








6 


0 


0.694 




0 


5 


6.882 


1 4E+0 








7 


0 


0.211 




0 


1 


0.663 


663.0E-3 








1 


> 10,280 


T.O. 


175.1E-3 


> 1,432 


>883,812 


T.O. 


2.0E-3 


0.0 






2 


> 19,587,838 


T.O. 


91.9E-6 


>o 


>0 


T.O. 






D00079 




3 


> 1,1 34,806 


T.O. 


1 .6E-3 


>2 1,048 


>0 


T.O. 


85.5E-3 


53.4 


C20H32O5 


25 


4 


> 17,852 


T.O. 


100.8E-3 


>0 


>0 


T.O. 










5 


>23 


T.O. 


78.3E+0 


>0 


>0 


T.O. 










6 


>0 


T.O. 




>0 


>0 


T.O. 










7 


>0 


T.O. 




>0 


>0 


T.O. 
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Table 4 Comparison of varying width w in chemical graphs (II) 



Entry 2-Phase Algorithm Our Algorithm 



Formula 


n 


w 


K 


Tree 


Time 


Time/graph 


Tree 


Cycle 


Time 


Time/graph 


Ratio 






1 


3 


14 


0.009 


643.9E-6 


14 


5 


0.014 


736.8E-6 


1.1 






2 


3 


49 


0.009 


183.7E-6 


49 


14 


0.017 


269.8E-6 


1.5 


D08040 




3 


3 


58 


0.009 


155.2E-6 


58 


21 


0.017 


21 5.1 E-6 


1.4 


C 5 H 9 N 3 


8 


4 


3 


60 


0.009 


1 50.0E-6 


60 


25 


0.017 


200.0E-6 


1.3 






5 


3 


60 


0.009 


1 50.0E-6 


60 


26 


0.017 


197.6E-6 


1.3 






50 


3 


61 


0.003 


49.2E-6 


61 


28 


0.017 


191.0E-6 


3.9 






1 


3 


30 


0.017 


566.7E-6 


30 


1,318 


0.170 


126.1 E-6 


0.2 






2 


3 


313 


0.024 


76.7E-6 


313 


8,822 


0.266 


29.1 E-6 


0.4 


D00332 




3 


3 


1,327 


0.024 


76.7E-6 


1,327 


18,010 


0.285 


14.7E-6 


0.8 


C 9 H17N0 2 


12 


4 


3 


2,239 


0.025 


1 1 .2E-6 


2,239 


24,550 


0.293 


10.9E-6 


1.0 






5 


3 


4,197 


0.025 


6.0E-6 


4,197 


30,122 


0.297 


8.7E-6 


1.5 






50 


3 


6,656 


0.025 


3.8E-6 


6,656 


34,145 


0.309 


7.6E-6 


2.0 






1 


3 


4,590 


4.115 


896.5 E-6 


4,590 


6,511 


70.806 


6.4E-3 


3.8 






2 


3 


76,901 


10.466 


136.1E-6 


76,901 


186,971 


221.353 


838.7E-6 


6.2 


D00555 




3 


3 


221,492 


14.952 


67.5E-6 


221,492 


770,625 


317.488 


320.0E-6 


4.7 


C,iH 18 N203 


16 


4 


3 


348,335 


16.381 


47.0E-3 


348,335 


1,307,167 


347.379 


209.8E-6 


4.5 






5 


3 


458,635 


16.837 


36.7E-3 


458,635 


1,976,544 


357.252 


146.7E-6 


4.0 






50 


3 


556,272 


1 7.090 


30.7E-3 


556,272 


3,544,713 


363.743 


88.7E-6 


2.9 






1 


3 


> 1,1 34,806 


T.O. 


1.6E-3 


> 2 1,048 


>0 


T.O. 


85.5E-3 


53.4 






2 


3 


>3,91 7,059 


T.O. 


459.5E-6 


>0 


>0 


T.O. 






D00079 




3 


3 


>86,360 


T.O. 


20.8E-3 


>28,187 


>0 


T.O. 


64.6E-3 


3.1 


C20H32O5 


25 


4 


3 


> 1,469,428 


T.O. 


1 .2E-3 


>6 1,929 


>229 


T.O. 


29.0E-3 


24.2 






5 


3 


>5,1 18,134 


T.O. 


351.7E-6 


> 19,900 


> 1,726 


T.O. 


83.2E-3 


236.6 






50 


3 


>2 16,008,008 


T.O. 


8.3 E-6 


>0 


>0 


T.O. 







2-augmented trees, 3-augmented trees, and 4-augmented 
trees, respectively. In this paper, we used the assump- 
tion that chemical graphs we treat contain only atoms 
with valence at most 4 (except benzene rings) in order 
to define the parent of a 1-augmented tree G as a 0- 
augmented tree T that is obtained by removing an edge 
corresponding to a single bond in G. However, it is not 
difficult to extend our enumeration algorithm for chem- 
ical graphs possibly with atoms with valence more than 
4 just by modifying the definition so that the parent of 
a 1-augmented tree G is allowed to be a 0-augmented 
tree T obtained by removing an edge that corresponds 
to a double or triple bond in G. Although benzene rings 
have already been treated as virtual atoms of valence 6, 
regioisomers are ignored in the proposed algorithm. As 
mentioned in "Experimental and results" section, an effi- 
cient algorithm for generating all possible regioisomers of 
a given 0-augmented tree structure with virtual atoms b 
has been developed [36]. Therefore, combination of the 
proposed algorithm with that algorithm is left as future 



work as well as further extensions for including atoms 
with valence more than 4 and furan and more general 
structures. 

Although we do not aim to develop enumeration algo- 
rithms that are directly applicable to drug design, this 
is a future target of our research. In order to apply 
enumeration algorithms to drug design, considering fea- 
tures based on the path frequency is far from suffi- 
cient. Factors such as hydrogen bond donors, hydrogen 
bond acceptors, positive charges, negative charges, and 
hydrophobic centers should be taken into account. In 
addition, the binding site information of the target 
molecule and geometric information such as the occur- 
rence of rotatable bonds should be reflected. In order 
to include these factors in enumeration algorithms, we 
should develop efficient methods that can relate chemical 
graphs with such physico-chemical and geometric fac- 
tors. However, such a development is not an easy task 
even for one type of factor and thus is long-term future 
work. 
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Additional file 

Additional file 1 : Proofs of Lemmas and Theorems. Proofs of Lemma 1 , 
Theorem 2, Lemma 3, and Lemma 4, descriptions of the 2-phase algorithm 
and the main algorithm, and some figures for chemical graphs and upper 
and lower feature vectors used in the computational experiment are given 
in this supplement. 
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